Using Weighted Oriented Optical Flow Histograms for Multimodal Speaker Diarization
نویسنده
چکیده
Speaker diarization currently focuses on using audio features to partition an audio stream into speaker homogeneous speech regions, in other words to determine “who spoke when”. Recent speaker diarization corpora contains video recordings in addition to the commonly used audio. Thus, we investigated the benefits of incorporating video features, namely histograms of weighted oriented optical flow, into a state-of-the-art diarization system. We found that although the video features did not perform well alone, after combining systems using audio and video features, we were able to improve the diarization error rate by 14% as compared to a speaker diarization system trained on audio-only features.
منابع مشابه
Multimodal speaker diarization using oriented optical flow histograms
Speaker diarization is the task of partitioning an input stream into speaker homogeneous regions, or in other words, to determine ”who spoke when.” While approaches to this problem have traditionally relied entirely on the audio stream, the availability of accompanying video streams in recent diarization corpora has prompted the study of methods based on multimodal audio-visual features. In thi...
متن کاملTokyo Tech at MediaEval 2016 Multimodal Person Discovery in Broadcast TV task
This paper describes our diarization system for the Multimodal Person Discovery in Broadcast TV task of the MediaEval 2016 Benchmark evaluation campaign [1]. The goal of this task is naming speakers, who are appearing and speaking simultaneously in the video, without prior knowledge. Our diarization system relies on face diarization approach. We extract deep features from a face every 0.5 secon...
متن کاملMultimodal Speaker Diarization Utilizing Face Clustering Information
Multimodal clustering/diarization tries to answer the question ”who spoke when” by using audio and visual information. Diarization consists of two steps, at first segmentation of the audio information and detection of the speech segments and then clustering of the speech segments to group the speakers. This task has been mainly studied on audiovisual data from meetings, news broadcasts or talk ...
متن کاملInteger linear programming for speaker diarization and cross-modal identification in TV broadcast
Most state-of-the-art approaches address speaker diarization as a hierarchical agglomerative clustering problem in the audio domain. In this paper, we propose to revisit one of them: speech turns clustering based on the Bayesian Information Criterion (a.k.a. BIC clustering). First, we show how to model it as an integer linear programming (ILP) problem. Its resolution leads to the same overall d...
متن کاملSpeaker Diarization Using a priori Acoustic Information
Speaker diarization is usually performed in a blind manner without using a priori knowledge about the identity or acoustic characteristics of the participating speakers. In this paper we propose a novel framework for incorporating available a priori knowledge such as potential participating speakers, channels, background noise and gender, and integrating these knowledge sources into blind speak...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008